¡Buen día Juan! Como te va? Espero que todo vaya bien!
Mi nombre es Emanuel Fitta! Un gusto conocerte, seré tu revisor en este proyecto.
A continuación un poco sobre la modalidad de revisión que usaremos:
Cuando enccuentro un error por primera vez, simplemente lo señalaré, te dejaré encontrarlo y arreglarlo tú cuenta. Además, a lo largo del texto iré haciendo algunas observaciones sobre mejora en tu código y también haré comentarios sobre tus percepciones sobre el tema. Pero si aún no puedes realizar esta tarea, te daré una pista más precisa en la próxima iteración y también algunos ejemplos prácticos. Estaré abierto a comentarios y discusiones sobre el tema.
Encontrará mis comentarios a continuación: no los mueva, modifique ni elimine.
Puedes encontrar mis comentarios en cuadros verdes, amarillos o rojos como este:
Necesita arreglos. Este apartado necesita algunas correcciones. El trabajo no puede ser aceptado con comentarios rojos.
Puedes responder utilizando esto:
Muy bien Juan! Quiero felicitarte, el trabajo que has realizado ha sido sumamente completo. Tienes buenas prácticas en tu codigo, haces buen uso de agrupaciones y funciones de transformación y limpieza de datos. Los gráficos están bien realizados y son muy claros. Realmente te felicito mucho por este gran trabajo, si continuas así no tengo duda alguna de que llegarás a ser una gran gran analista de datos. Te felicito, por favor sigue así!
Te deseo mucho éxito!
Project 7¶
Step 1: Open file and survey¶
Import liberies
import pandas as pd
import numpy as np
from scipy import stats as st
# Read dataset.
df = pd.read_csv('/datasets/games.csv')
Data exploration
df.head()
| Name | Platform | Year_of_Release | Genre | NA_sales | EU_sales | JP_sales | Other_sales | Critic_Score | User_Score | Rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Wii Sports | Wii | 2006.0 | Sports | 41.36 | 28.96 | 3.77 | 8.45 | 76.0 | 8 | E |
| 1 | Super Mario Bros. | NES | 1985.0 | Platform | 29.08 | 3.58 | 6.81 | 0.77 | NaN | NaN | NaN |
| 2 | Mario Kart Wii | Wii | 2008.0 | Racing | 15.68 | 12.76 | 3.79 | 3.29 | 82.0 | 8.3 | E |
| 3 | Wii Sports Resort | Wii | 2009.0 | Sports | 15.61 | 10.93 | 3.28 | 2.95 | 80.0 | 8 | E |
| 4 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | 11.27 | 8.89 | 10.22 | 1.00 | NaN | NaN | NaN |
# I found missing data on 'Critic_Score', 'User_Score', 'Rating'; Wrong Dtype in 'Year_of_Release'.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16715 entries, 0 to 16714 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 16713 non-null object 1 Platform 16715 non-null object 2 Year_of_Release 16446 non-null float64 3 Genre 16713 non-null object 4 NA_sales 16715 non-null float64 5 EU_sales 16715 non-null float64 6 JP_sales 16715 non-null float64 7 Other_sales 16715 non-null float64 8 Critic_Score 8137 non-null float64 9 User_Score 10014 non-null object 10 Rating 9949 non-null object dtypes: float64(6), object(5) memory usage: 1.4+ MB
Step 2: Data cleaning¶
Replace column names¶
# Lowered name columns.
df.columns = df.columns.str.lower()
df.columns
Index(['name', 'platform', 'year_of_release', 'genre', 'na_sales', 'eu_sales',
'jp_sales', 'other_sales', 'critic_score', 'user_score', 'rating'],
dtype='object')
Muy bien Juan! Empezamos muy pero muy bien! utilizaste el método info, el cual nos ayuda a dar un vistazo general del dataframe. Además, veo que te adelantaste y empezaste a transformar tu dataframe cambiando los nombres de las columnas a minúsculas, lo cual es una buena práctica.
Look for duplicates¶
# I did not found duplicates
df['name'] = df['name'].str.lower()
df['platform'] = df['platform'].str.upper()
df['genre'] = df['genre'].str.lower()
df['rating'] = df['rating'].str.upper()
df.duplicated().sum()
0
Handle missing data¶
# Missing Data, Im trying to understad why there are missing data, perhaps only older games have missing data.
# There are missing data in all years.
df[df["critic_score"].isna()]['year_of_release'].sort_values().unique()
array([1980., 1981., 1982., 1983., 1984., 1985., 1986., 1987., 1988.,
1989., 1990., 1991., 1992., 1993., 1994., 1995., 1996., 1997.,
1998., 1999., 2000., 2001., 2002., 2003., 2004., 2005., 2006.,
2007., 2008., 2009., 2010., 2011., 2012., 2013., 2014., 2015.,
2016., nan])
df['critic_score'].sort_values().unique()
array([13., 17., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29.,
30., 31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42.,
43., 44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55.,
56., 57., 58., 59., 60., 61., 62., 63., 64., 65., 66., 67., 68.,
69., 70., 71., 72., 73., 74., 75., 76., 77., 78., 79., 80., 81.,
82., 83., 84., 85., 86., 87., 88., 89., 90., 91., 92., 93., 94.,
95., 96., 97., 98., nan])
# name
print(df[df['name'].isna()])
name platform year_of_release genre na_sales eu_sales jp_sales \
659 NaN GEN 1993.0 NaN 1.78 0.53 0.00
14244 NaN GEN 1993.0 NaN 0.00 0.00 0.03
other_sales critic_score user_score rating
659 0.08 NaN NaN NaN
14244 0.00 NaN NaN NaN
# I delete this 2 rows because there's no valuable information on them, I cannot know what games they are.
df.dropna(subset=['name'], inplace=True)
print(f"There are {df['name'].isna().sum()} missing values in 'name'")
There are 0 missing values in 'name'
# Missing data per column
print(f"there are {df['platform'].isna().sum()} missing values in 'platform'")
print(f"there are {df['year_of_release'].isna().sum()} missing values in 'year_of_release'")
print(f"there are {df['genre'].isna().sum()} missing values in 'genre'")
print(f"there are {df['na_sales'].isna().sum()} missing values in 'na_sales'")
print(f"there are {df['eu_sales'].isna().sum()} missing values in 'eu_sales'")
print(f"there are {df['jp_sales'].isna().sum()} missing values in 'jp_sales'")
print(f"there are {df['other_sales'].isna().sum()} missing values in 'other_sales'")
print(f"there are {df['critic_score'].isna().sum()} missing values in 'critic_score'")
print(f"there are {df['user_score'].isna().sum()} missing values in 'user_score'")
print(f"there are {df['rating'].isna().sum()} missing values in 'rating'")
there are 0 missing values in 'platform' there are 269 missing values in 'year_of_release' there are 0 missing values in 'genre' there are 0 missing values in 'na_sales' there are 0 missing values in 'eu_sales' there are 0 missing values in 'jp_sales' there are 0 missing values in 'other_sales' there are 8576 missing values in 'critic_score' there are 6699 missing values in 'user_score' there are 6764 missing values in 'rating'
Fill missing values
# year_of_release
df['year_of_release'] = df['year_of_release'].fillna(0)
# critic_score and user_score
df['critic_score'] = df['critic_score'].fillna(0)
df['user_score'] = df['user_score'].fillna(0)
# rating
df['rating'] = df['rating'].fillna('Unknown')
Me gusta la estrategia que sigues para hacer la imputación de valores nulos sobre las columnas year_of_release y rating, sin embargo, sobre clos score, tanto de los criticos como de los usuarios, deberíamos tener cuidado, debido a que 0, puede ser considerado una escala más.
print(f"there are {df['year_of_release'].isna().sum()} missing values in 'year_of_release'")
print(f"there are {df['critic_score'].isna().sum()} missing values in 'critic_score'")
print(f"there are {df['user_score'].isna().sum()} missing values in 'user_score'")
print(f"there are {df['rating'].isna().sum()} missing values in 'rating'")
print()
print("***It is very important to remember, when calculating scores,\nto exclude 0 values, as including them might skew the results.***")
there are 0 missing values in 'year_of_release' there are 0 missing values in 'critic_score' there are 0 missing values in 'user_score' there are 0 missing values in 'rating' ***It is very important to remember, when calculating scores, to exclude 0 values, as including them might skew the results.***
# Check for missing values
df.isna().sum()
name 0 platform 0 year_of_release 0 genre 0 na_sales 0 eu_sales 0 jp_sales 0 other_sales 0 critic_score 0 user_score 0 rating 0 dtype: int64
Change types¶
# Change Dtypes.
# Column "year_of_release" is not needed as float, since we are talking about years, I change it to int.
df['year_of_release'] = df['year_of_release'].astype('int')
# Column "critic_score" has no decimal values, so I change it int.
df['critic_score'] = df['critic_score'].astype('int')
# "tbd" means to be determined, for our reserch is missing data, so I change it to "0".
df['user_score'] = df['user_score'].replace({'tbd': 0}).astype('float')
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 16713 entries, 0 to 16714 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 16713 non-null object 1 platform 16713 non-null object 2 year_of_release 16713 non-null int64 3 genre 16713 non-null object 4 na_sales 16713 non-null float64 5 eu_sales 16713 non-null float64 6 jp_sales 16713 non-null float64 7 other_sales 16713 non-null float64 8 critic_score 16713 non-null int64 9 user_score 16713 non-null float64 10 rating 16713 non-null object dtypes: float64(5), int64(2), object(4) memory usage: 1.5+ MB
Total sales¶
# I create "total_sales" as a new column with the sum of the 4 sales columns.
df["total_sales"] = df["na_sales"] + df["eu_sales"] + df["jp_sales"] + df["other_sales"]
# I rearenge columns order.
position = df.columns.get_loc("other_sales") + 1
df.insert(position, "total_sales", df.pop("total_sales"))
# The Dataframe is clean and ready to be used.
df.sample(20)
| name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | total_sales | critic_score | user_score | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1733 | wwe smackdown vs. raw 2010 | PS3 | 2009 | fighting | 0.48 | 0.48 | 0.01 | 0.20 | 1.17 | 81 | 8.1 | T |
| 4605 | rush 2: extreme racing usa | N64 | 1998 | racing | 0.35 | 0.06 | 0.00 | 0.00 | 0.41 | 0 | 0.0 | Unknown |
| 5354 | classic nes series: ice climber | GBA | 2004 | platform | 0.10 | 0.04 | 0.20 | 0.01 | 0.35 | 66 | 0.0 | E |
| 4091 | motorstorm: apocalypse | PS3 | 2011 | racing | 0.21 | 0.19 | 0.00 | 0.08 | 0.48 | 77 | 7.7 | T |
| 10803 | jikkyou powerful pro yakyuu 11 | GC | 2004 | sports | 0.00 | 0.00 | 0.09 | 0.00 | 0.09 | 0 | 0.0 | Unknown |
| 10086 | super swing golf season 2 | WII | 2007 | sports | 0.08 | 0.00 | 0.03 | 0.01 | 0.12 | 70 | 0.0 | E10+ |
| 10765 | csi: hard evidence | X360 | 2007 | adventure | 0.07 | 0.02 | 0.00 | 0.01 | 0.10 | 0 | 0.0 | Unknown |
| 9872 | just dance: disney party | WII | 2012 | misc | 0.01 | 0.10 | 0.00 | 0.01 | 0.12 | 0 | 0.0 | E |
| 15165 | hatsukare * renai debut sengen! | PSP | 2013 | adventure | 0.00 | 0.00 | 0.02 | 0.00 | 0.02 | 0 | 0.0 | Unknown |
| 4829 | monster trucks mayhem | WII | 2009 | racing | 0.38 | 0.00 | 0.00 | 0.02 | 0.40 | 0 | 0.0 | E |
| 372 | donkey kong country 3: dixie kong's double tro... | SNES | 1996 | platform | 1.17 | 0.50 | 1.75 | 0.08 | 3.50 | 0 | 0.0 | Unknown |
| 3937 | the adventures of jimmy neutron boy genius: at... | GBA | 2004 | action | 0.36 | 0.13 | 0.00 | 0.01 | 0.50 | 56 | 0.0 | E |
| 9836 | dirt 2 | DS | 2009 | racing | 0.06 | 0.04 | 0.00 | 0.01 | 0.11 | 73 | 0.0 | E |
| 16127 | kono aozora ni yakusoku o: tenohira no rakuen | PSP | 2009 | adventure | 0.00 | 0.00 | 0.01 | 0.00 | 0.01 | 0 | 0.0 | Unknown |
| 15098 | peter pan in disney's return to never land | PS | 2002 | platform | 0.01 | 0.01 | 0.00 | 0.00 | 0.02 | 0 | 0.0 | Unknown |
| 3120 | 007: quantum of solace | WII | 2008 | action | 0.29 | 0.28 | 0.01 | 0.07 | 0.65 | 54 | 7.5 | T |
| 4462 | singstar queen | PS3 | 2009 | misc | 0.11 | 0.25 | 0.00 | 0.07 | 0.43 | 0 | 0.0 | Unknown |
| 1520 | mega man battle network 3 blue / white version | GBA | 2002 | role-playing | 0.39 | 0.02 | 0.85 | 0.03 | 1.29 | 0 | 0.0 | Unknown |
| 8343 | transformers: dark of the moon | WII | 2011 | action | 0.09 | 0.06 | 0.00 | 0.02 | 0.17 | 0 | 0.0 | Unknown |
| 14979 | blazblue: chrono phantasma extend | PSV | 2015 | action | 0.00 | 0.02 | 0.00 | 0.01 | 0.03 | 0 | 6.8 | T |
Step 3: Data analysis¶
Games released by year¶
# How many games were released per year.
df.groupby('year_of_release')['name'].count().sort_values(ascending=False).reset_index().head()
| year_of_release | name | |
|---|---|---|
| 0 | 2008 | 1427 |
| 1 | 2009 | 1426 |
| 2 | 2010 | 1255 |
| 3 | 2007 | 1197 |
| 4 | 2011 | 1136 |
import plotly.express as px
# years "0" are missing values, thats why I need to filter the df before creating an histogram.
df_filtered = df[df["year_of_release"] != 0]
hist_releases = px.histogram(df_filtered, x="year_of_release",
title="Most of the releases were between 2007 and 2011",
labels={"year_of_release": "Year"}
#text_auto=True
)
# Mostrar la gráfica
hist_releases.show()
Sales per platform¶
# Platforms with more total sales.
best_sellers = df.groupby('platform')['total_sales'].sum().sort_values(ascending=False).reset_index().head(6)
best_sellers
| platform | total_sales | |
|---|---|---|
| 0 | PS2 | 1255.77 |
| 1 | X360 | 971.42 |
| 2 | PS3 | 939.65 |
| 3 | WII | 907.51 |
| 4 | DS | 806.12 |
| 5 | PS | 730.86 |
Lifetime per platform¶
# Best platforms selling titles.
# df_filtered excludes years with value "0"
""" Play Station One """
ps_one = df_filtered[df_filtered['platform'] == 'PS']
hist_sales = px.histogram(
ps_one, # Data
x="year_of_release", # Eje X: los años
y="total_sales", # Eje Y: ingresos
title="Play Station One lifetime",
color_discrete_sequence=["grey"],
labels={"year_of_release": "Year",
"total_sales": "Sales"}
#text_auto=True
)
# Mostrar la gráfica
hist_sales.show()
Vale, el gráfico se ve bien, pero estoy notando que a lo largo de los gráficos que realizas, no estás colocando tus observaciones y conclusiones, es importante que realices esto, debido a que es un habito que en el futuro te ayudará, recuerda que traducir lo que nos datos nos quieren decir, es una de las principales labores de un analista.
""" Nintendo DS """
nintendo_ds = df_filtered[df_filtered['platform'] == 'DS']
hist_sales = px.histogram(
nintendo_ds, # Data
x="year_of_release", # Eje X: los años
y="total_sales", # Eje Y: ingresos
title="Nintendo DS lifetime",
color_discrete_sequence=["blue"],
labels={"year_of_release": "Year",
"total_sales": "Sales"},
range_x=[2003, 2015]
#text_auto=True
)
# Mostrar la gráfica
hist_sales.show()
""" Nintendo Wii """
nintendo_wii = df_filtered[df_filtered['platform'] == 'WII']
hist_sales = px.histogram(
nintendo_wii, # Data
x="year_of_release", # Eje X: los años
y="total_sales", # Eje Y: ingresos
title="Nintendo Wii lifetime",
color_discrete_sequence=["white"],
labels={"year_of_release": "Year",
"total_sales": "Sales"},
range_x=[2005, 2017]
#text_auto=True
)
# Mostrar la gráfica
hist_sales.show()
""" Play Station 3 """
ps_3 = df_filtered[df_filtered['platform'] == 'PS3']
hist_sales = px.histogram(
ps_3, # Data
x="year_of_release", # Eje X: los años
y="total_sales", # Eje Y: ingresos
title="Play Station 3 lifetime",
color_discrete_sequence=["black"],
labels={"year_of_release": "Year",
"total_sales": "Sales"}
#text_auto=True
)
# Mostrar la gráfica
hist_sales.show()
""" Xbox 360 """
xbox_360 = df_filtered[df_filtered['platform'] == 'X360']
hist_sales = px.histogram(
xbox_360, # Data
x="year_of_release", # Eje X: los años
y="total_sales", # Eje Y: ingresos
title="Xbox 360 lifetime",
color_discrete_sequence=["rgb(242,242,242)"],
labels={"year_of_release": "Year",
"total_sales": "Sales"}
#text_auto=True
)
# Mostrar la gráfica
hist_sales.show()
""" Play Station 2 """
ps_2 = df_filtered[df_filtered['platform'] == 'PS2']
hist_sales = px.histogram(
ps_2, # Data
x="year_of_release", # Eje X: los años
y="total_sales", # Eje Y: ingresos
title="Play Station 2 lifetime",
color_discrete_sequence=["rgb(102,102,102)"],
labels={"year_of_release": "Year",
"total_sales": "Sales"}
#text_auto=True
)
# Mostrar la gráfica
hist_sales.show()
En todos los casos, los gráficos están correctos, sin embargo es importante que los interpretes, esto hará tu notebook mucho más completo. Otra sugerencia que tengo es que debido a que estás realizando una comparación de diferentes plataformas, valdría mucho la pena que colocarás estos en un solo plot, para poder observar mejor las diferencias.
Comparison¶
# I create a dataframe with only the Best selling platforms.
best_platforms_names = best_sellers['platform']
df_best_platforms = df_filtered[df_filtered['platform'].isin(best_platforms_names)]
df_best_sellers = df_best_platforms[['platform', 'name', 'year_of_release', 'total_sales']]
df_best_sellers
| platform | name | year_of_release | total_sales | |
|---|---|---|---|---|
| 0 | WII | wii sports | 2006 | 82.54 |
| 2 | WII | mario kart wii | 2008 | 35.52 |
| 3 | WII | wii sports resort | 2009 | 32.77 |
| 6 | DS | new super mario bros. | 2006 | 29.80 |
| 7 | WII | wii play | 2006 | 28.91 |
| ... | ... | ... | ... | ... |
| 16700 | DS | mezase!! tsuri master ds | 2009 | 0.01 |
| 16704 | DS | plushees | 2008 | 0.01 |
| 16709 | PS2 | score international baja 1000: the official game | 2008 | 0.00 |
| 16710 | PS3 | samurai warriors: sanada maru | 2016 | 0.01 |
| 16711 | X360 | lma manager 2007 | 2006 | 0.01 |
9262 rows × 4 columns
# I group the total sales by platform and year.
grouped_df = df_best_sellers.groupby(["platform", "year_of_release"], as_index=False)['total_sales'].sum()
grouped_df = grouped_df.drop(index=0).reset_index(drop=True)
grouped_df
| platform | year_of_release | total_sales | |
|---|---|---|---|
| 0 | DS | 2004 | 17.27 |
| 1 | DS | 2005 | 130.14 |
| 2 | DS | 2006 | 119.81 |
| 3 | DS | 2007 | 146.94 |
| 4 | DS | 2008 | 145.31 |
| ... | ... | ... | ... |
| 61 | X360 | 2012 | 99.74 |
| 62 | X360 | 2013 | 88.58 |
| 63 | X360 | 2014 | 34.74 |
| 64 | X360 | 2015 | 11.96 |
| 65 | X360 | 2016 | 1.52 |
66 rows × 3 columns
# Crear la gráfica de líneas
fig = px.line(
grouped_df,
x="year_of_release", # Eje X: Año de lanzamiento
y="total_sales", # Eje Y: Ventas totales
color="platform", # Líneas separadas por consola
markers=True,
title="The most successful platforms tend to remain in the market for around 10 years",
labels={"year_of_release": "Years", "total_sales": "Total Sales (in millions)", "platform": "Platform"}
)
# Mostrar la gráfica
fig.show()
Used to be popular¶
PS2, X360, PS3, Wii, DS, and PS were once the most successful platforms and, historically, the most lucrative. However, none of them are active these days (as of 2016), They had a life span of approximately 10 years, they are no longer generating sales. On the other hand, there are some new platforms that are just emerging and starting to grow. I'll explore them next.
Looking for the next bestsellers¶
# Best selling platforms since 2014
df_filtered[df_filtered['year_of_release'] >= 2014].groupby('platform')['total_sales'].sum().sort_values(ascending=False).head(5)
platform PS4 288.15 XONE 140.36 3DS 86.68 PS3 68.18 X360 48.22 Name: total_sales, dtype: float64
df = df_filtered
# New best selling platforms
new_sellers = df[df['platform'].isin(['PS4', 'XONE', '3DS'])]
new_sellers_by_years = new_sellers.groupby(['platform', 'year_of_release'])['total_sales'].sum().reset_index()
# line graph
fig = px.line(
new_sellers_by_years,
x="year_of_release", # Eje X: Año de lanzamiento
y="total_sales", # Eje Y: Ventas totales
color="platform", # Líneas separadas por consola
markers=True,
title="PS4 is the most successful platform currently",
labels={"year_of_release": "Years", "total_sales": "Total Sales (in millions)", "platform": "Platform"}
)
# Mostrar la gráfica
fig.show()
Box graphs¶
fig = px.box(
df,
x="platform",
y="total_sales",
color="platform",
title="Total sales distribution",
labels={"platform": "Platform", "total_sales": "Total Sales (in millions)"}
)
fig.update_layout(
height=800, # Aumentar altura
width=1200 # Aumentar ancho
)
fig.show()
#outliers
df.query("platform == 'WII' and total_sales > 1.9")
| name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | total_sales | critic_score | user_score | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | wii sports | WII | 2006 | sports | 41.36 | 28.96 | 3.77 | 8.45 | 82.54 | 76 | 8.0 | E |
| 2 | mario kart wii | WII | 2008 | racing | 15.68 | 12.76 | 3.79 | 3.29 | 35.52 | 82 | 8.3 | E |
| 3 | wii sports resort | WII | 2009 | sports | 15.61 | 10.93 | 3.28 | 2.95 | 32.77 | 80 | 8.0 | E |
| 7 | wii play | WII | 2006 | misc | 13.96 | 9.18 | 2.93 | 2.84 | 28.91 | 58 | 6.6 | E |
| 8 | new super mario bros. wii | WII | 2009 | platform | 14.44 | 6.94 | 4.70 | 2.24 | 28.32 | 87 | 8.4 | E |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 818 | kirby's epic yarn | WII | 2010 | platform | 1.42 | 0.09 | 0.45 | 0.10 | 2.06 | 0 | 0.0 | Unknown |
| 819 | wipeout: the game | WII | 2009 | misc | 1.94 | 0.00 | 0.00 | 0.12 | 2.06 | 0 | 0.0 | Unknown |
| 843 | rayman raving rabbids: tv party | WII | 2008 | misc | 0.72 | 1.08 | 0.00 | 0.23 | 2.03 | 73 | 7.7 | E10+ |
| 850 | just dance 2015 | WII | 2014 | misc | 0.99 | 0.85 | 0.00 | 0.17 | 2.01 | 0 | 8.4 | E10+ |
| 887 | call of duty: world at war | WII | 2008 | shooter | 1.17 | 0.58 | 0.00 | 0.18 | 1.93 | 83 | 7.6 | M |
64 rows × 12 columns
wii = df.query("platform == 'WII'")
fig = px.box(
wii,
x="total_sales",
y="platform",
title="Wii Total sales distribution",
labels={"platform": " ", "total_sales": "Total Sales (in millions)"}
)
fig.update_layout(
height=800, # Aumentar altura
width=1100 # Aumentar ancho
)
fig.show()
nes = df.query("platform == 'NES' and total_sales > 4.38")
fig = px.box(
nes,
x="total_sales",
y="platform",
title="NES Total sales distribution",
labels={"platform": " ", "total_sales": "Total Sales (in millions)"}
)
fig.update_layout(
height=800, # Aumentar altura
width=1100 # Aumentar ancho
)
fig.show()
gb = df.query("platform == 'GB'")
fig = px.box(
gb,
x="total_sales",
y="platform",
title="GB Total sales distribution",
labels={"platform": " ", "total_sales": "Total Sales (in millions)"}
)
fig.update_layout(
height=800, # Aumentar altura
width=1100 # Aumentar ancho
)
fig.show()
ds = df.query("platform == 'DS'")
fig = px.box(
ds,
x="total_sales",
y="platform",
title="DS Total sales distribution",
labels={"platform": " ", "total_sales": "Total Sales (in millions)"}
)
fig.update_layout(
height=800, # Aumentar altura
width=1100 # Aumentar ancho
)
fig.show()
Most platforms have an mean close to zero.
Corelation¶
# Box graph to understand outlier values on sales
sales = df.query("total_sales != 0")
fig = px.box(
sales,
x="total_sales")
fig.update_layout(
xaxis_range=[0, 2])
fig.show()
# Critic Score VS Total Sales
critic_score = df.query("critic_score != 0 and total_sales < 1.1")
correlation_critic_score = critic_score["critic_score"].corr(critic_score["total_sales"])
print(f"Correlation: {correlation_critic_score}")
# Fig
fig = px.scatter(
critic_score,
x="total_sales",
y="critic_score",
title="There is a slight positive correlation between Critic Scores and Sales.",
labels={"critic_score": "Critic Score", "total_sales": "Total Sales (in millions)"},
trendline="ols"
)
for trace in fig.data:
if trace.mode == "lines": # Identificar la línea de tendencia
trace.line.color = "red" # Cambiar el color
fig.update_layout(
yaxis_range=[0, 100], xaxis_range=[0, 1.1])
# Mostrar la gráfica
fig.show()
Correlation: 0.267071717767249
# User Score VS Total Sales
user_score = df.query("user_score != 0 and total_sales < 1.1")
correlation_user_score = user_score["user_score"].corr(user_score["total_sales"])
print(f"Correlation: {correlation_user_score}")
fig = px.scatter(
user_score,
x="total_sales", # Eje X
y="user_score", # Eje Y
title="There is a slight positive correlation between User Score and Sales",
labels={"critic_score": "Critic Score", "total_sales": "Total Sales (in millions)"},
trendline="ols" # Opcional: agrega una línea de tendencia
)
for trace in fig.data:
if trace.mode == "lines": # Identificar la línea de tendencia
trace.line.color = "red" # Cambiar el color
fig.update_layout(
yaxis_range=[0, 10],
xaxis_range=[0, 1.1])
# Mostrar la gráfica
fig.show()
Correlation: 0.10769542636565735
There is a correlation between rating and sales, but it is very low in both cases.
Same games on diferent platforms¶
df.groupby('name')['platform'].count().sort_values(ascending=False)
name
need for speed: most wanted 12
ratatouille 9
lego marvel super heroes 9
fifa 14 9
madden nfl 07 8
..
infamous: first light 1
infamous: second son 1
infected 1
infernal: hell's vengeance 1
¡shin chan flipa en colores! 1
Name: platform, Length: 11426, dtype: int64
# Most lucrative games on diferent platforms
result = df.groupby("name").agg(
platform=("platform", lambda x: " - ".join(sorted(x.unique()))), # Listar plataformas únicas
total_sales=("total_sales", "sum"), # Sumar las ventas totales
).reset_index()
result.sort_values(by="total_sales", ascending=False).head(20)
| name | platform | total_sales | |
|---|---|---|---|
| 10927 | wii sports | WII | 82.54 |
| 3691 | grand theft auto v | PC - PS3 - PS4 - X360 - XONE | 56.58 |
| 9300 | super mario bros. | GB - NES | 45.31 |
| 9660 | tetris | GB - NES | 35.84 |
| 5501 | mario kart wii | WII | 35.52 |
| 10929 | wii sports resort | WII | 32.77 |
| 7316 | pokemon red/pokemon blue | GB | 31.38 |
| 1220 | call of duty: modern warfare 3 | PC - PS3 - WII - X360 | 30.60 |
| 6614 | new super mario bros. | DS | 29.80 |
| 1215 | call of duty: black ops ii | PC - PS3 - WIIU - X360 | 29.40 |
| 1213 | call of duty: black ops | DS - PS3 - WII - X360 | 29.20 |
| 10925 | wii play | WII | 28.91 |
| 6617 | new super mario bros. wii | WII | 28.32 |
| 2580 | duck hunt | NES | 28.31 |
| 1217 | call of duty: ghosts | PC - PS3 - PS4 - WIIU - X360 - XONE | 27.39 |
| 9314 | super mario world | GBA - SNES | 26.09 |
| 1214 | call of duty: black ops 3 | PC - PS3 - PS4 - X360 - XONE | 25.67 |
| 1219 | call of duty: modern warfare 2 | PC - PS3 - X360 | 24.97 |
| 6770 | nintendogs | DS | 24.67 |
| 5862 | minecraft | PS3 - PS4 - PSV - WIIU - X360 - XONE | 24.16 |
# Grand Theft Auto V
grand_theft_auto_v = df.loc[df['name'] == "grand theft auto v"]
grand_theft_auto_v['total_sales'].round(2)
grand_theft_auto_v
| name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | total_sales | critic_score | user_score | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16 | grand theft auto v | PS3 | 2013 | action | 7.02 | 9.09 | 0.98 | 3.96 | 21.05 | 97 | 8.2 | M |
| 23 | grand theft auto v | X360 | 2013 | action | 9.66 | 5.14 | 0.06 | 1.41 | 16.27 | 97 | 8.1 | M |
| 42 | grand theft auto v | PS4 | 2014 | action | 3.96 | 6.31 | 0.38 | 1.97 | 12.62 | 97 | 8.3 | M |
| 165 | grand theft auto v | XONE | 2014 | action | 2.81 | 2.19 | 0.00 | 0.47 | 5.47 | 97 | 7.9 | M |
| 1730 | grand theft auto v | PC | 2015 | action | 0.39 | 0.69 | 0.00 | 0.09 | 1.17 | 96 | 7.9 | M |
fig = px.bar(
grand_theft_auto_v,
x="platform",
y="total_sales",
title="grand_theft_auto_v",
labels={"platform": "Platform", "total_sales": "Total Sales (in millions)"},
color="platform",
text="total_sales" # Mostrar valores en las barras
)
fig.show()
# Super Mario
# Not sure to count it as the same game, since the second one was released 14 years after the first one.
super_mario = df[df['name'] == "super mario bros."]
super_mario.loc[:, 'total_sales'].round(2)
super_mario
| name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | total_sales | critic_score | user_score | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | super mario bros. | NES | 1985 | platform | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 | 0 | 0.0 | Unknown |
| 201 | super mario bros. | GB | 1999 | platform | 3.40 | 1.30 | 0.15 | 0.22 | 5.07 | 0 | 0.0 | Unknown |
fig = px.bar(
super_mario,
x="platform",
y="total_sales",
title="super_mario",
color="platform",
labels={"platform": "Platform", "total_sales": "Total Sales (in millions)"}
)
fig.show()
# Tetris
tetris = df[df['name'] == "tetris"]
tetris
| name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | total_sales | critic_score | user_score | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | tetris | GB | 1989 | puzzle | 23.20 | 2.26 | 4.22 | 0.58 | 30.26 | 0 | 0.0 | Unknown |
| 157 | tetris | NES | 1988 | puzzle | 2.97 | 0.69 | 1.81 | 0.11 | 5.58 | 0 | 0.0 | Unknown |
fig = px.bar(
tetris,
x="platform",
y="total_sales",
title="Tetris",
color="platform",
labels={"platform": "Platform", "total_sales": "Total Sales (in millions)"}
)
fig.show()
# call of duty: modern warfare 3
cod = df[df['name'] == "call of duty: modern warfare 3"]
cod
| name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | total_sales | critic_score | user_score | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29 | call of duty: modern warfare 3 | X360 | 2011 | shooter | 9.04 | 4.24 | 0.13 | 1.32 | 14.73 | 88 | 3.4 | M |
| 37 | call of duty: modern warfare 3 | PS3 | 2011 | shooter | 5.54 | 5.73 | 0.49 | 1.57 | 13.33 | 88 | 3.2 | M |
| 1040 | call of duty: modern warfare 3 | PC | 2011 | shooter | 0.41 | 0.98 | 0.00 | 0.32 | 1.71 | 78 | 2.5 | M |
| 2500 | call of duty: modern warfare 3 | WII | 2011 | shooter | 0.55 | 0.20 | 0.00 | 0.08 | 0.83 | 70 | 1.8 | M |
fig = px.bar(
cod,
x="platform",
y="total_sales",
title="call of duty: modern warfare 3",
color="platform",
text="total_sales",
labels={"platform": "Platform", "total_sales": "Total Sales (in millions)"}
)
fig.show()
It seems that titles sell better on consoles than on PC. Additionally, games on Game Boy might perform well, but they don't even come close to the numbers achieved on the NES.
Genre vs Sales¶
# I group and arrange the values as needed. (I formatted the numbers to show only two decimal places.)
sales_sorted = sales.groupby("genre", as_index=False)["total_sales"].sum().sort_values(by="total_sales", ascending=False)
sales_sorted["total_sales"] = sales_sorted["total_sales"].round(2)
sales_sorted
| genre | total_sales | |
|---|---|---|
| 0 | action | 1716.52 |
| 10 | sports | 1309.67 |
| 8 | shooter | 1041.36 |
| 7 | role-playing | 931.08 |
| 4 | platform | 825.55 |
| 3 | misc | 790.29 |
| 6 | racing | 723.49 |
| 2 | fighting | 442.66 |
| 9 | simulation | 387.96 |
| 5 | puzzle | 239.89 |
| 1 | adventure | 233.33 |
| 11 | strategy | 172.57 |
# Distribution of genre by sales
fig = px.bar(
sales_sorted,
x="genre",
y="total_sales",
title="Genre distribution by sales",
labels={"genre": "Genre", "total_sales": "Sales (in millions)"},
color="genre",
text="total_sales" # Mostrar valores en las barras
)
fig.show()
About genres in videogames
Apparently, it seems more lucrative to sell action games than strategy games. However, I should analyze the data more closely to clarify this hypothesis. Perhaps the "action" genre generates the highest amount of revenue because it is the most popular genre and includes a large number of titles, not necessarily because all of those games are major hits. On the other hand, "strategy" games might have a few great-selling titles but lack the sheer number of releases.
# DF games by name
games_grouped = df.groupby("name", as_index=False).agg(
platform=("platform", lambda x: ", ".join(sorted(x.unique()))),
total_sales=("total_sales", "sum"), genre=("genre", lambda x: ", ".join(sorted(x.unique())))
).sort_values(by="total_sales", ascending=False)
games_grouped
| name | platform | total_sales | genre | |
|---|---|---|---|---|
| 10927 | wii sports | WII | 82.54 | sports |
| 3691 | grand theft auto v | PC, PS3, PS4, X360, XONE | 56.58 | action |
| 9300 | super mario bros. | GB, NES | 45.31 | platform |
| 9660 | tetris | GB, NES | 35.84 | puzzle |
| 5501 | mario kart wii | WII | 35.52 | racing |
| ... | ... | ... | ... | ... |
| 2860 | exhibition volume 6 | XB | 0.01 | misc |
| 2843 | evil genius | PC | 0.01 | strategy |
| 6981 | ozzy & drix | GBA | 0.01 | platform |
| 4784 | katekyoo hitman reborn nerae!? ring x bongole ... | PS2 | 0.01 | adventure |
| 3399 | g1 jockey 4 2008 | PS3 | 0.00 | sports |
11426 rows × 4 columns
# Count of games by Genre
games_grouped['genre'].value_counts()
action 1929 sports 1366 misc 1317 role-playing 1213 adventure 1052 shooter 812 racing 761 simulation 717 fighting 606 platform 579 strategy 578 puzzle 491 action, adventure 2 shooter, strategy 1 misc, strategy 1 action, shooter 1 Name: genre, dtype: int64
"Action" is the most popular genre among all the games
Action includes 1,929 titles, followed by Sports with 1,366. On the other hand, Strategy has only 578 titles.
#Action games
action_games = games_grouped[games_grouped['genre'] == 'action']
action_games.sort_values(by='total_sales', ascending=False).head(20)
| name | platform | total_sales | genre | |
|---|---|---|---|---|
| 3691 | grand theft auto v | PC, PS3, PS4, X360, XONE | 56.58 | action |
| 3695 | grand theft auto: san andreas | PC, PS2, X360, XB | 23.85 | action |
| 3690 | grand theft auto iv | PC, PS3, X360 | 22.37 | action |
| 3696 | grand theft auto: vice city | PC, PS2 | 16.19 | action |
| 3090 | fifa soccer 13 | 3DS, PC, PS3, PSP, PSV, WII, WIIU, X360 | 15.97 | action |
| 5157 | lego star wars: the complete saga | DS, PS3, WII, X360 | 15.25 | action |
| 3689 | grand theft auto iii | PC, PS2 | 13.11 | action |
| 487 | assassin's creed iv: black flag | PC, PS3, PS4, WIIU, X360, XONE | 13.06 | action |
| 485 | assassin's creed iii | PC, PS3, WIIU, X360 | 13.01 | action |
| 7795 | red dead redemption | PS3, X360 | 12.80 | action |
| 9907 | the legend of zelda: ocarina of time | 3DS, N64 | 11.93 | action |
| 7304 | pokemon heartgold/pokemon soulsilver | DS | 11.77 | action |
| 483 | assassin's creed ii | PC, PS3, X360 | 11.34 | action |
| 3693 | grand theft auto: liberty city stories | PS2, PSP | 11.21 | action |
| 671 | batman: arkham city | PC, PS3, WIIU, X360 | 11.10 | action |
| 9876 | the last of us | PS3, PS4 | 10.58 | action |
| 9085 | star wars: the force unleashed | DS, PS2, PS3, PSP, WII, X360 | 10.02 | action |
| 5140 | lego indiana jones: the original adventures | DS, PC, PS2, PS3, PSP, X360 | 9.57 | action |
| 5146 | lego marvel super heroes | 3DS, DS, PC, PS3, PS4, PSV, WIIU, X360, XONE | 9.41 | action |
| 10853 | watch dogs | PC, PS3, PS4, WIIU, X360, XONE | 9.18 | action |
#I create a box plot to understand the mean and outliers.
# Following that, I generate a histogram to visualize the data distribution.
fig = px.box(
action_games,
x="total_sales",
title="Action games layouts",
color_discrete_sequence=["#1F77B4"],
labels={"total_sales": "Total Sales (in millions)"}
)
fig.update_layout(
xaxis_range=[0, 1])
fig.show()
fig = px.histogram(
action_games,
x="total_sales",
title="Action games distribution",
color_discrete_sequence=["#1F77B4"],
labels={"total_sales": "Total Sales (in millions)"}
)
fig.update_layout(
xaxis_range=[0, 7])
fig.show()
# Strategy games
strategy = games_grouped[games_grouped['genre'] == 'strategy']
strategy.sort_values(by='total_sales', ascending=False).head(20)
| name | platform | total_sales | genre | |
|---|---|---|---|---|
| 7321 | pokemon stadium | N64 | 5.45 | strategy |
| 10850 | warzone 2100 | PS | 5.01 | strategy |
| 9094 | starcraft ii: wings of liberty | PC | 4.83 | strategy |
| 10804 | warcraft ii: tides of darkness | PC | 4.20 | strategy |
| 7339 | pokémon trading card game | GB | 3.70 | strategy |
| 1557 | command & conquer: red alert | PC, PS | 3.27 | strategy |
| 7338 | pokémon stadium 2 | N64 | 2.74 | strategy |
| 8617 | sim theme park | PC, PS | 2.71 | strategy |
| 3882 | halo wars | X360 | 2.62 | strategy |
| 10102 | theme hospital | PC, PS | 2.49 | strategy |
| 10807 | warcraft: orcs & humans | PC | 2.08 | strategy |
| 11324 | yu-gi-oh! the eternal duelist soul | GBA | 2.07 | strategy |
| 8585 | sid meier's civilization revolution | DS, PS3, X360 | 2.05 | strategy |
| 11201 | xcom: enemy unknown | PC, PS3, X360 | 1.90 | strategy |
| 10253 | tom clancy's endwar | DS, PC, PS3, PSP, X360 | 1.82 | strategy |
| 331 | angry birds star wars | 3DS, PS3, PS4, PSV, WII, WIIU, X360, XONE | 1.73 | strategy |
| 9092 | starcraft ii: heart of the swarm | PC | 1.67 | strategy |
| 8587 | sid meier's civilization v | PC | 1.67 | strategy |
| 7215 | pikmin | GC | 1.63 | strategy |
| 11304 | yu-gi-oh! duel monsters | GB | 1.62 | strategy |
#Strategy outliers
fig = px.box(
strategy,
x="total_sales",
title="Strategy games layouts",
color_discrete_sequence=["#fb8500"],
labels={"total_sales": "Total Sales (in millions)"}
)
fig.update_layout(
xaxis_range=[0, 1])
fig.show()
# Strategy distribution
fig = px.histogram(
strategy,
x="total_sales",
title="Strategy games distribution",
color_discrete_sequence=["#fb8500"],
labels={"total_sales": "Total Sales (in millions)"}
)
fig.update_layout(
xaxis_range=[0, 6])
fig.show()
Action games generally generate better revenue than strategy games. However, that doesn’t exclude strategy games from having great titles that can achieve good sales. Nevertheless, action games are generally better received by the audience than strategy games.
Step 4: User profile¶
# I group the df per platform.
# recent years.
df_recent_years = df[df['year_of_release'] >= 2013]
platforms_grouped = df_recent_years.groupby('platform')[['na_sales', 'eu_sales', 'jp_sales', 'total_sales']].sum().reset_index()
platforms_grouped.sort_values(by='total_sales', ascending=False).head()
| platform | na_sales | eu_sales | jp_sales | total_sales | |
|---|---|---|---|---|---|
| 4 | PS4 | 108.74 | 141.09 | 15.96 | 314.14 |
| 3 | PS3 | 63.50 | 67.81 | 23.35 | 181.43 |
| 10 | XONE | 93.12 | 51.59 | 0.34 | 159.32 |
| 0 | 3DS | 38.20 | 30.96 | 67.81 | 143.25 |
| 9 | X360 | 81.66 | 42.52 | 0.51 | 136.80 |
5 Principal platforms per region (NA, UE, JP)¶
""" North America """
america_platforms = platforms_grouped.sort_values(by='na_sales', ascending=False)[['platform', 'na_sales']]
america_platforms['na_sales'] = america_platforms['na_sales'].round(2)
fig = px.bar(
america_platforms.head(5),
x="platform",
y="na_sales",
title="Principal platforms in North America",
color="platform",
labels={"platform": "Platform", "na_sales": "Sales (in millions)"},
text="na_sales" # Mostrar valores en las barras
)
fig.show()
best_na_platforms = []
for console in america_platforms['platform'].head():
best_na_platforms.append(console)
print(f"The best selling platforms in North America are:\n{best_na_platforms}")
The best selling platforms in North America are: ['PS4', 'XONE', 'X360', 'PS3', '3DS']
""" Europe """
europe_platforms = platforms_grouped.sort_values(by='eu_sales', ascending=False)[['platform', 'eu_sales']]
europe_platforms['eu_sales'] = europe_platforms['eu_sales'].round(2)
fig = px.bar(
europe_platforms.head(5),
x="platform",
y="eu_sales",
title="Principal platforms in Europe",
color="platform",
text="eu_sales",
labels={"platform": "Platform", "eu_sales": "Sales (in millions)"},
)
fig.show()
best_eu_platforms = []
for console in europe_platforms['platform'].head():
best_eu_platforms.append(console)
print(f"The best selling platforms in Europe are:\n{best_na_platforms}")
The best selling platforms in Europe are: ['PS4', 'XONE', 'X360', 'PS3', '3DS']
""" Japan """
japan_platforms = platforms_grouped.sort_values(by='jp_sales', ascending=False)[['platform', 'jp_sales']]
japan_platforms['jp_sales'] = japan_platforms['jp_sales'].round(2)
fig = px.bar(
japan_platforms.head(5),
x="platform",
y="jp_sales",
title="Principal platforms in Japan",
color="platform",
text="jp_sales",
labels={"platform": "Platform", "jp_sales": "Sales (in millions)"},
)
fig.show()
best_jp_platforms = []
for console in japan_platforms['platform'].head():
best_jp_platforms.append(console)
print(f"The best selling platforms in Japan are:\n{best_jp_platforms}")
The best selling platforms in Japan are: ['3DS', 'PS3', 'PSV', 'PS4', 'WIIU']
5 Principal videogame genres per region¶
#DF games grouped by genres
games_by_genre = (
df_recent_years.groupby('genre', as_index=False)
[['na_sales', 'eu_sales', 'jp_sales', 'total_sales']].sum().reset_index(drop=True))
"""I’ll avoid the miscellaneous genre since those games are so different from each other
that I don’t think they should be placed in a single category. Because of this,
I’ll treat them as missing data for now."""
games_by_genre.drop(index=3, inplace=True)
games_by_genre.reset_index(drop=True, inplace=True)
games_by_genre
| genre | na_sales | eu_sales | jp_sales | total_sales | |
|---|---|---|---|---|---|
| 0 | action | 126.05 | 118.13 | 40.49 | 321.87 |
| 1 | adventure | 7.14 | 8.25 | 5.82 | 23.64 |
| 2 | fighting | 15.55 | 8.55 | 7.65 | 35.31 |
| 3 | platform | 18.14 | 15.58 | 4.79 | 42.63 |
| 4 | puzzle | 0.83 | 1.00 | 1.18 | 3.17 |
| 5 | racing | 12.96 | 20.19 | 2.30 | 39.89 |
| 6 | role-playing | 46.40 | 36.97 | 51.04 | 145.89 |
| 7 | shooter | 109.74 | 87.86 | 6.61 | 232.98 |
| 8 | simulation | 4.86 | 10.92 | 4.52 | 21.76 |
| 9 | sports | 65.27 | 60.52 | 5.41 | 150.65 |
| 10 | strategy | 3.28 | 4.22 | 1.77 | 10.08 |
""" North America """
america_genres = games_by_genre.sort_values(by='na_sales', ascending=False)[['genre', 'na_sales']]
america_genres['na_sales'] = america_genres['na_sales'].round(2)
popular_na_genres = []
for genre in america_genres['genre'].head():
popular_na_genres.append(genre)
fig = px.bar(
america_genres.head(5),
x="genre",
y="na_sales",
labels={"genre": "Genre", "na_sales": "Sales (in millions)"},
title=f"Principal genres in North America are: {', '.join(popular_na_genres)}.",
color="genre",
text="na_sales",
color_discrete_map={
"shooter": "red",
"sports": "green",
"action": "blue",
"role-playing": "goldenrod",
"misc": "cyan"})
fig.show()
""" Europe """
europe_genres = games_by_genre.sort_values(by='eu_sales', ascending=False)[['genre', 'eu_sales']]
europe_genres['eu_sales'] = europe_genres['eu_sales'].round(2)
popular_eu_genres = []
for genre in europe_genres['genre'].head():
popular_eu_genres.append(genre)
fig = px.bar(
europe_genres.head(5),
x="genre",
y="eu_sales",
labels={"genre": "Genre", "eu_sales": "Sales (in millions)"},
title=f"Principal genres in Europe are: {', '.join(popular_eu_genres)}.",
color="genre",
text="eu_sales",
color_discrete_map={
"shooter": "red",
"sports": "green",
"action": "blue",
"role-playing": "goldenrod",
"misc": "magenta",
"racing": "purple"})
fig.show()
""" Japan """
japan_genres = games_by_genre.sort_values(by='jp_sales', ascending=False)[['genre', 'jp_sales']]
japan_genres['jp_sales'] = japan_genres['jp_sales'].round(2)
popular_jp_genres = []
for genre in japan_genres['genre'].head():
popular_jp_genres.append(genre)
fig = px.bar(
japan_genres.head(5),
x="genre",
y="jp_sales",
labels={"genre": "Genre", "jp_sales": "Sales (in millions)"},
title=f"Principal genres in Japan are: {', '.join(popular_eu_genres)}.",
color="genre",
text="jp_sales",
color_discrete_map={
"shooter": "red",
"sports": "green",
"action": "blue",
"role-playing": "goldenrod",
"adventure": "orange",
"racing": "purple",
"fighting": "#66AA00"})
fig.show()
Principal Genres
In North America and Europe, action games and shooter games are the most dominant genres. In contrast, in Japan, role-playing games are the most popular.
Additionally, in North America and Europe, the sports genre is quite important. However, this is not the case in Japan, where they seem to be more interested in fighting games.
Rating per region¶
# I group games by rating
games_by_rating = (
df_recent_years.groupby('rating', as_index=False)
[['na_sales', 'eu_sales', 'jp_sales', 'total_sales']].sum().reset_index(drop=True))
# I remove the unknown ratings
games_by_rating.drop(index=4, inplace=True)
columns_to_round = ['na_sales', 'eu_sales', 'jp_sales']
games_by_rating[columns_to_round] = games_by_rating[columns_to_round].round(2)
""" North America """
fig = px.bar(
games_by_rating.sort_values(by='na_sales', ascending=False),
x="rating",
y="na_sales",
labels={"rating": "Rating", "na_sales": "Sales (in millions)"},
title="'M' rating leads the sales in North America",
text="na_sales",
color="rating",
color_discrete_map={
"E": "purple",
"E10+": "yellow",
"M": "red",
"T": "blue"})
fig.show()
""" Europe """
fig = px.bar(
games_by_rating.sort_values(by='eu_sales', ascending=False),
x="rating",
y="eu_sales",
labels={"rating": "Rating", "eu_sales": "Sales (in millions)"},
title="'E' rating leads the sales in Europe",
text="eu_sales",
color="rating",
color_discrete_map={
"E": "purple",
"E10+": "yellow",
"M": "red",
"T": "blue"})
fig.show()
""" Japan """
fig = px.bar(
games_by_rating.sort_values(by='jp_sales', ascending=False),
x="rating",
y="jp_sales",
labels={"rating": "Rating", "jp_sales": "Sales (in millions)"},
title="'T' Rate leads the sales in Japan",
text="jp_sales",
color="rating",
color_discrete_map={
"E": "purple",
"E10+": "yellow",
"M": "red",
"T": "blue"})
fig.show()
Profiles conclusion¶
In North America, the audience prefer PlayStation 4 and Xbox One consoles, along with video game genres like action, shooters, and sports. They also tend to favor games with a Mature (M) rating.
In Europe, the audience prefers the PlayStation 4 console, along with video game genres like action, shooters, and sports. They also tend to favor games with a Mature (M) rating. This is very similar to North America, but in Europe, the preference for the PS4 over the Xbox One is significantly greater.
Japan is build diferent, in this region the audience prefers de Nintendo 3DS, along with video games genres Role-playing, action, and fighting games. They also tend to favor games with a Teen (T) rating.
Step 5: hypotesis¶
# I avoid scores with 0, since they are missing values.
df_scores = df.loc[df["user_score"] != 0]
# Xbox user score.
xbox_user_score = df_scores[df_scores['platform'] == 'XONE']['user_score'].reset_index(drop=True)
# PC user score.
pc_user_score = df_scores[df_scores['platform'] == 'PC']['user_score'].reset_index(drop=True)
User score for Xbox and PC¶
Nule hypotesis: The mean user score is statistically the same for Xbox One and PC.
Alternate hypotesis: The mean user score is statistacally diferent for Xbox One and PC.
# The mean user score for the Xbox One and PC platforms are the same.
# Mean Scores
print(f"Mean user score for Xbox is {xbox_user_score.mean().round(2)}",
f"\nMean user score for PC is {pc_user_score.mean().round(2)}")
Mean user score for Xbox is 6.52 Mean user score for PC is 7.07
T Test 1
# significance level ("critical value")
alpha = .05
# Test with diferent variance, for two independent statistical population.
results = st.ttest_ind(xbox_user_score, pc_user_score, equal_var = False)
print('P Value:', results.pvalue)
if results.pvalue < alpha:
print("We reject the null hypothesis.")
else:
print("We fail to reject the null hypothesis.")
P Value: 4.5385802963771835e-06 We reject the null hypothesis.
User Score for Action and Sport¶
Nule hypotesis: The mean user score is statistically the same for Action and Sport video games.
Alternate hypotesis: The mean user score is statistacally diferent for Action and Sport video games.
# Action games user score
action_user_score = df_scores[df_scores['genre'] == 'action']['user_score'].reset_index(drop=True)
# Spot games user score
sport_user_score = df_scores[df_scores['genre'] == 'sports']['user_score'].reset_index(drop=True)
# Is the mean user score statistaclly diferent for Action and Sport genres?
# Mean Scores
print(f"Mean user score for Action games is {action_user_score.mean().round(2)}",
f"\nMean user score for Sport games is {sport_user_score.mean().round(2)}")
Mean user score for Action games is 7.06 Mean user score for Sport games is 6.95
T Test 2
# significance level ("critical value")
alpha = .05
# Test with diferent variance, for two independent statistical population.
results = st.ttest_ind(action_user_score, sport_user_score, equal_var = False)
print('P Value:', results.pvalue)
if results.pvalue < alpha:
print("We reject the null hypothesis.")
else:
print("We fail to reject the null hypothesis.")
P Value: 0.07370217419278526 We fail to reject the null hypothesis.
Step 6: Conclusion¶
Español
A continuación enunciaré los hallazgos de esta investigación:
En este análisis descubrí que PS2, X360, PS3, WII, DS, y PS son las plataformas más exitosas en la historia de los videojuegos (hasta 2016).
Las plataformas tienden a tener un periodo de vida de 9 a 10 años. Entre los años 2007 al 2011 fue el momento que más títulos de videojuegos fueron lanzados al mercado.
El PS4, el Xbox One y el nintendo 3DS son las consolas más exitosas a la fecha de esta investigación, por sus ventas y por estar vigentes en el mercado.
Existe una ligera correlación entre las crítica de los expertos y las ventas de los videojuegos, asimismo hay ligera correlación entre la calificiación de los usuarios y las ventas.
Los mismos títulos tienen mejores ventas en consolas como PS4 o Xbox One en comparación con su contraparte para PC. De igual manera algunos de los grandes títulos lanzados para NES en los años 80, como lo son "Super Mario bros" y el legendario "Tetris", se volvieron a lanzar para Game Boy (GB) hacia finales de los años 90, y si bien estos relanzamientos lograron algunas ventas, sus números están muy por debajo de lo que lograron en NES más de una década antes.
Definitivamente la categoría más lucrativa es "Action", y la menos lucrativa es "Strategy", sin embargo vale la pena recalcar que la mayoría de los juegos son etiquetados como "Action" por lo cual puede no ser muy acertado pensar que los juegos "Action" tienden a ser mejor recibidos, sino que al ser la categoría más predominante es la categoría que más dinero ha generado.
Sobre los perfiles de usuarios por región puedo concluir lo siguiente: El perfil para Estados Unidos es PlayStation 4 o Xbox One casi a la par, con gustos en juegos clasificación M y del genero acción, shooter, o deportes.
Europa es muy similar a Estados unidos pero la predilecció en plataforma para Europa es más fuerte hacia el PlayStation 4, también prefieren los juegos clasificación M (adultos) y de los generos acción, shooter, o deportes.
Japón es el más distinto, en esta región los jugadores prefieren la consola nintentdo 3DS, quizá por ser orginaria de ese país; prefieren los juegos clasificación T (adolescentes) y también hay una diferencia en la preferencia de los géneros, pues ellos prefieren en primer lugar los Role-playing games, seguido de los juegos de acción y los juegos de pelea.
Por último, puse a prueba la hipótesis nula "las calificaciones promedio de los usuarios para las plataformas Xbox One y PC son las mismas", y descubrí que sí existe una diferencia estadística, es muy poco probable que haya sido una variación de los datos, podemos recharzar la hipótesis nula, y tomar la alternariva indicando que son difentes poblaciones.
por el contrario cuando revisé la hipótesis "las calificaciones promedio de los usuarions para los géneros de Acción y Deportes son iguales", si establezco el umbral (alpha) en .05, como es convencional, la Prueba T me indica que no podemos rechazar la hipótesis nula, por lo tanto sugiere que estadísticamente son iguales estas poblaciones; sin embargo, el P. Value de esta prueba es muy cercano al umbral (.07), si yo moviera el alfa en vez del 5% al 10% entonces la prueba me sugeriría recharzar la hipótesis nula.
English
Below, I will outline the findings of this research:
In this analysis, I discovered that the PS2, X360, PS3, Wii, DS, and PS are the most successful platforms in video game history (up to 2016).
Platforms tend to have a lifespan of 9 to 10 years. Between 2007 and 2011 was the period when the most video game titles were released to the market.
The PS4, Xbox One, and Nintendo 3DS are the most successful consoles at the time of this research, due to their sales and continued presence in the market.
There is a slight correlation between expert reviews and video game sales. Similarly, there is a slight correlation between user ratings and sales.
The same titles tend to sell better on consoles like PS4 or Xbox One compared to their PC counterparts. Similarly, some of the great titles released for the NES in the 1980s, such as Super Mario Bros. and the legendary Tetris, were re-released for the Game Boy (GB) in the late 1990s. While these re-releases achieved some sales, their numbers are far below what they achieved on the NES more than a decade earlier.
The most lucrative category is definitely "Action," while the least lucrative is "Strategy." However, it is worth noting that most games are labeled as "Action," which makes it inaccurate to assume that "Action" games tend to be better received. Instead, being the most predominant category, it is the one that has generated the most revenue.
Regarding user profiles by region, I can conclude the following:
The profile for the United States is almost evenly split between PlayStation 4 and Xbox One, with preferences for games rated M and genres such as action, shooters, or sports. Europe is very similar to the United States, but the preference for PlayStation 4 is much stronger in this region. They also prefer M-rated games and genres like action, shooters, or sports. Japan is the most distinct. In this region, players prefer the Nintendo 3DS console, perhaps because it originates from that country. They favor T-rated games (for teens), and their genre preferences differ as well, with Role-Playing Games being the most popular, followed by action and fighting games. Finally, I tested the null hypothesis: "The average user scores for Xbox One and PC are the same," and I found that there is a statistically significant difference. It is highly unlikely that this difference is due to data variation. We can reject the null hypothesis and accept the alternative, indicating that they are different populations.
On the other hand, when I tested the hypothesis, "The average user scores for Action and Sports genres are the same," if I set the threshold (alpha) at .05, as is conventional, the T-test indicates that we cannot reject the null hypothesis. This suggests that these populations are statistically the same. However, the p-value of this test is very close to the threshold (.07). If I were to move alpha from 5% to 10%, the test would suggest rejecting the null hypothesis.